Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data
نویسندگان
چکیده
In this work, the use of a phrasal lexicon for statistical machine translation is proposed, and the relation between data acquisition costs and translation quality for different types and sizes of language resources has been analyzed. The language pairs are Spanish-English and Catalan-English, and the translation is performed in all directions. The phrasal lexicon is used to increase as well as to replace the original training corpus. The augmentation of the phrasal lexicon with the help of additional monolingual language resources containing morpho-syntactic information has been investigated for the translation with scarce training material. Using the augmented phrasal lexicon as additional training data, a reasonable translation quality can be achieved with only 1000 sentence pairs from the desired domain.
منابع مشابه
Machine translation: statistical approach with additional linguistic knowledge
In this thesis, three possible aspects of using linguistic (i.e. morpho-syntactic) knowledge for statistical machine translation are described: the treatment of syntactic differences between source and target language using source POS tags, statistical machine translation with a small amount of bilingual training data, and automatic error analysis of translation output. Reorderings in the sourc...
متن کاملStatistical Machine Translation with Scarce Resources Using Morpho-syntactic Information
In statistical machine translation, correspondences between the words in the source and the target language are learned from parallel corpora, and often little or no linguistic knowledge is used to structure the underlying models. In particular, existing statistical systems for machine translation often treat different inflected forms of the same lemma as if they were independent of one another...
متن کاملAugmenting a Small Parallel Text with Morpho-syntactic Language Resources for Serbian-English Statistical Machine Translation
In this work, we examine the quality of several statistical machine translation systems constructed on a small amount of parallel Serbian-English text. The main bilingual parallel corpus consists of about 3k sentences and 20k running words from an unrestricted domain. The translation systems are built on the full corpus as well as on a reduced corpus containing only 200 parallel sentences. A sm...
متن کاملImproving Phrase-Based SMT with Morpho-Syntactic Analysis and Transformation
This paper presents our study of exploiting morpho-syntactic information for phrase-based statistical machine translation (SMT). For morphological transformation, we use hand-crafted transformational rules. For syntactic transformation, we propose a transformational model based on Bayes’ formula. The model is trained using a bilingual corpus and a broad coverage parser of the source language. T...
متن کاملDealing with Sign Language Morphemes in Statistical Machine Translation
The aim of this research is to establish the role of linguistic information in data-scarce statistical machine translation for sign languages using freely available tools. The main challenge in statistical machine translation is the scarcity of suitable data, and this problem becomes more pronounced in sign languages. The available corpora are small, usually not domain-specific, and their annot...
متن کامل